Keras: Deep Learning library for Theano and TensorFlow

Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. ref: https://keras.io/

Why this name, Keras?

Keras (κέρας) means horn in Greek. It is a reference to a literary image from ancient Greek and Latin literature, first found in the Odyssey, where dream spirits (Oneiroi, singular Oneiros) are divided between those who deceive men with false visions, who arrive to Earth through a gate of ivory, and those who announce a future that will come to pass, who arrive through a gate of horn. It's a play on the words κέρας (horn) / κραίνω (fulfill), and ἐλέφας (ivory) / ἐλεφαίρομαι (deceive).

Keras was initially developed as part of the research effort of project ONEIROS (Open-ended Neuro-Electronic Intelligent Robot Operating System).

"Oneiroi are beyond our unravelling --who can be sure what tale they tell? Not all that men look for comes to pass. Two gates there are that give passage to fleeting Oneiroi; one is made of horn, one of ivory. The Oneiroi that pass through sawn ivory are deceitful, bearing a message that will not be fulfilled; those that come out through polished horn have truth behind them, to be accomplished for men who see them." Homer, Odyssey 19. 562 ff (Shewring translation).

Kaggle Challenge Data (again)

See: Data Description


In [1]:
from kaggle_data import load_data, preprocess_data, preprocess_labels

X_train, labels = load_data('data/kaggle_ottogroup/train.csv', train=True)
X_train, scaler = preprocess_data(X_train)
Y_train, encoder = preprocess_labels(labels)

X_test, ids = load_data('data/kaggle_ottogroup/test.csv', train=False)

X_test, _ = preprocess_data(X_test, scaler)

nb_classes = Y_train.shape[1]
print(nb_classes, 'classes')

dims = X_train.shape[1]
print(dims, 'dims')


Using TensorFlow backend.
9 classes
93 dims

Hands On - Keras Logistic Regression


In [2]:
from keras.models import Sequential
from keras.layers import Dense, Activation

In [5]:
dims = X_train.shape[1]
print(dims, 'dims')
print("Building model...")

nb_classes = Y_train.shape[1]
print(nb_classes, 'classes')

model = Sequential()
model.add(Dense(nb_classes, input_shape=(dims,)))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy')
model.fit(X_train, Y_train)


93 dims
Building model...
9 classes
Epoch 1/10
61878/61878 [==============================] - 2s - loss: 1.0577     
Epoch 2/10
61878/61878 [==============================] - 3s - loss: 0.7702     
Epoch 3/10
61878/61878 [==============================] - 2s - loss: 0.7283     
Epoch 4/10
61878/61878 [==============================] - 2s - loss: 0.7075     
Epoch 5/10
61878/61878 [==============================] - 2s - loss: 0.6945     
Epoch 6/10
61878/61878 [==============================] - 2s - loss: 0.6854     
Epoch 7/10
61878/61878 [==============================] - 2s - loss: 0.6787     
Epoch 8/10
61878/61878 [==============================] - 2s - loss: 0.6735     
Epoch 9/10
61878/61878 [==============================] - 2s - loss: 0.6694     
Epoch 10/10
61878/61878 [==============================] - 2s - loss: 0.6659     
Out[5]:
<keras.callbacks.History at 0x1031e4e80>

Simplicity is pretty impressive right? :)

Now lets understand:

The core data structure of Keras is a model, a way to organize layers. The main type of model is the Sequential model, a linear stack of layers.

What we did here is stacking a Fully Connected (Dense) layer of trainable weights from the input to the output and an Activation layer on top of the weights layer.

Dense
from keras.layers.core import Dense

Dense(units, activation=None, use_bias=True, 
      kernel_initializer='glorot_uniform', bias_initializer='zeros', 
      kernel_regularizer=None, bias_regularizer=None, 
      activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
  • units: int > 0.

  • init: name of initialization function for the weights of the layer (see initializations), or alternatively, Theano function to use for weights initialization. This parameter is only relevant if you don't pass a weights argument.

  • activation: name of activation function to use (see activations), or alternatively, elementwise Theano function. If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).

  • weights: list of Numpy arrays to set as initial weights. The list should have 2 elements, of shape (input_dim, output_dim) and (output_dim,) for weights and biases respectively.

  • kernel_regularizer: instance of WeightRegularizer (eg. L1 or L2 regularization), applied to the main weights matrix.

  • bias_regularizer: instance of WeightRegularizer, applied to the bias.

  • activity_regularizer: instance of ActivityRegularizer, applied to the network output.

  • kernel_constraint: instance of the constraints module (eg. maxnorm, nonneg), applied to the main weights matrix.

  • bias_constraint: instance of the constraints module, applied to the bias.

  • use_bias: whether to include a bias (i.e. make the layer affine rather than linear).

(some) others keras.core.layers

  • keras.layers.core.Flatten()
  • keras.layers.core.Reshape(target_shape)
  • keras.layers.core.Permute(dims)
model = Sequential()
model.add(Permute((2, 1), input_shape=(10, 64)))
# now: model.output_shape == (None, 64, 10)
# note: `None` is the batch dimension
  • keras.layers.core.Lambda(function, output_shape=None, arguments=None)
  • keras.layers.core.ActivityRegularization(l1=0.0, l2=0.0)

Credits: Yam Peleg (@Yampeleg)

Activation
from keras.layers.core import Activation

Activation(activation)

Supported Activations : [https://keras.io/activations/]

Advanced Activations: [https://keras.io/layers/advanced-activations/]

Optimizer

If you need to, you can further configure your optimizer. A core principle of Keras is to make things reasonably simple, while allowing the user to be fully in control when they need to (the ultimate control being the easy extensibility of the source code). Here we used SGD (stochastic gradient descent) as an optimization algorithm for our trainable weights.

"Data Sciencing" this example a little bit more

What we did here is nice, however in the real world it is not useable because of overfitting. Lets try and solve it with cross validation.

Overfitting

In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.

A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

To avoid overfitting, we will first split out data to training set and test set and test out model on the test set.
Next: we will use two of keras's callbacks EarlyStopping and ModelCheckpoint

Let's see first the model we implemented


In [7]:
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_2 (Dense)              (None, 9)                 846       
_________________________________________________________________
activation_2 (Activation)    (None, 9)                 0         
=================================================================
Total params: 846.0
Trainable params: 846.0
Non-trainable params: 0.0
_________________________________________________________________

In [8]:
from sklearn.model_selection import train_test_split
from keras.callbacks import EarlyStopping, ModelCheckpoint

In [11]:
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.15, random_state=42)

fBestModel = 'best_model.h5' 
early_stop = EarlyStopping(monitor='val_loss', patience=4, verbose=1) 
best_model = ModelCheckpoint(fBestModel, verbose=0, save_best_only=True)
model.fit(X_train, Y_train, validation_data = (X_val, Y_val), epochs=20, 
          batch_size=128, verbose=True, callbacks=[best_model, early_stop])


Train on 52596 samples, validate on 9282 samples
Epoch 1/20
52596/52596 [==============================] - 0s - loss: 0.6634 - val_loss: 0.6561
Epoch 2/20
52596/52596 [==============================] - 0s - loss: 0.6626 - val_loss: 0.6562
Epoch 3/20
52596/52596 [==============================] - 0s - loss: 0.6619 - val_loss: 0.6562
Epoch 4/20
52596/52596 [==============================] - 0s - loss: 0.6613 - val_loss: 0.6562
Epoch 5/20
52596/52596 [==============================] - 0s - loss: 0.6607 - val_loss: 0.6561
Epoch 6/20
52596/52596 [==============================] - 0s - loss: 0.6601 - val_loss: 0.6557
Epoch 7/20
52596/52596 [==============================] - 0s - loss: 0.6596 - val_loss: 0.6554
Epoch 8/20
52596/52596 [==============================] - 0s - loss: 0.6591 - val_loss: 0.6551
Epoch 9/20
52596/52596 [==============================] - 0s - loss: 0.6586 - val_loss: 0.6550
Epoch 10/20
52596/52596 [==============================] - 0s - loss: 0.6582 - val_loss: 0.6548
Epoch 11/20
52596/52596 [==============================] - 0s - loss: 0.6577 - val_loss: 0.6545
Epoch 12/20
52596/52596 [==============================] - 0s - loss: 0.6572 - val_loss: 0.6544
Epoch 13/20
52596/52596 [==============================] - 0s - loss: 0.6568 - val_loss: 0.6541
Epoch 14/20
52596/52596 [==============================] - 1s - loss: 0.6563 - val_loss: 0.6538
Epoch 15/20
52596/52596 [==============================] - 0s - loss: 0.6559 - val_loss: 0.6534
Epoch 16/20
52596/52596 [==============================] - 1s - loss: 0.6555 - val_loss: 0.6533
Epoch 17/20
52596/52596 [==============================] - 0s - loss: 0.6551 - val_loss: 0.6534
Epoch 18/20
52596/52596 [==============================] - 0s - loss: 0.6548 - val_loss: 0.6529
Epoch 19/20
52596/52596 [==============================] - 0s - loss: 0.6544 - val_loss: 0.6525
Epoch 20/20
52596/52596 [==============================] - 0s - loss: 0.6540 - val_loss: 0.6523
Out[11]:
<keras.callbacks.History at 0x11835d048>

Multi-Layer Perceptron and Fully Connected

So, how hard can it be to build a Multi-Layer percepton with keras? It is baiscly the same, just add more layers!


In [12]:
model = Sequential()
model.add(Dense(100, input_shape=(dims,)))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy')
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_3 (Dense)              (None, 100)               9400      
_________________________________________________________________
dense_4 (Dense)              (None, 9)                 909       
_________________________________________________________________
activation_3 (Activation)    (None, 9)                 0         
=================================================================
Total params: 10,309.0
Trainable params: 10,309.0
Non-trainable params: 0.0
_________________________________________________________________

In [13]:
model.fit(X_train, Y_train, validation_data = (X_val, Y_val), epochs=20, 
          batch_size=128, verbose=True)


Train on 52596 samples, validate on 9282 samples
Epoch 1/20
52596/52596 [==============================] - 1s - loss: 1.2076 - val_loss: 0.8897
Epoch 2/20
52596/52596 [==============================] - 1s - loss: 0.8247 - val_loss: 0.7779
Epoch 3/20
52596/52596 [==============================] - 0s - loss: 0.7595 - val_loss: 0.7378
Epoch 4/20
52596/52596 [==============================] - 1s - loss: 0.7289 - val_loss: 0.7153
Epoch 5/20
52596/52596 [==============================] - 1s - loss: 0.7101 - val_loss: 0.7008
Epoch 6/20
52596/52596 [==============================] - 1s - loss: 0.6973 - val_loss: 0.6903
Epoch 7/20
52596/52596 [==============================] - 1s - loss: 0.6880 - val_loss: 0.6814
Epoch 8/20
52596/52596 [==============================] - 1s - loss: 0.6809 - val_loss: 0.6760
Epoch 9/20
52596/52596 [==============================] - 1s - loss: 0.6753 - val_loss: 0.6711
Epoch 10/20
52596/52596 [==============================] - 0s - loss: 0.6705 - val_loss: 0.6678
Epoch 11/20
52596/52596 [==============================] - 1s - loss: 0.6669 - val_loss: 0.6642
Epoch 12/20
52596/52596 [==============================] - 1s - loss: 0.6636 - val_loss: 0.6607
Epoch 13/20
52596/52596 [==============================] - 1s - loss: 0.6608 - val_loss: 0.6588
Epoch 14/20
52596/52596 [==============================] - 1s - loss: 0.6584 - val_loss: 0.6565
Epoch 15/20
52596/52596 [==============================] - 0s - loss: 0.6563 - val_loss: 0.6559
Epoch 16/20
52596/52596 [==============================] - 0s - loss: 0.6545 - val_loss: 0.6547
Epoch 17/20
52596/52596 [==============================] - 0s - loss: 0.6529 - val_loss: 0.6524
Epoch 18/20
52596/52596 [==============================] - 0s - loss: 0.6513 - val_loss: 0.6503
Epoch 19/20
52596/52596 [==============================] - 0s - loss: 0.6500 - val_loss: 0.6489
Epoch 20/20
52596/52596 [==============================] - 0s - loss: 0.6487 - val_loss: 0.6481
Out[13]:
<keras.callbacks.History at 0x1199eb6d8>

Your Turn!

Hands On - Keras Fully Connected

Take couple of minutes and Try and optimize the number of layers and the number of parameters in the layers to get the best results.


In [ ]:
model = Sequential()
model.add(Dense(100, input_shape=(dims,)))

# ...
# ...
# Play with it! add as much layers as you want! try and get better results.

model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy')
model.fit(X_train, Y_train, validation_data = (X_val, Y_val), epochs=20, 
          batch_size=128, verbose=True)

Building a question answering system, an image classification model, a Neural Turing Machine, a word2vec embedder or any other model is just as fast. The ideas behind deep learning are simple, so why should their implementation be painful?

Theoretical Motivations for depth

Much has been studied about the depth of neural nets. Is has been proven mathematically[1] and empirically that convolutional neural network benifit from depth!

[1] - On the Expressive Power of Deep Learning: A Tensor Analysis - Cohen, et al 2015

Theoretical Motivations for depth

One much quoted theorem about neural network states that:

Universal approximation theorem states[1] that a feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of $\mathbb{R}^n$, under mild assumptions on the activation function. The theorem thus states that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters; however, it does not touch upon the algorithmic learnability of those parameters.

[1] - Approximation Capabilities of Multilayer Feedforward Networks - Kurt Hornik 1991